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Abstract 



Many popular Bayesian Nonparametric priors can be characterized in terms of 



exchangeable species sampling sequences. However, in some applications, exchange- 
ability may not be appropriate. Wc introduce non exchangeable generalized species 
00 sampling sequences characterized by a tractable predictive probability function with 

^ weights driven by a sequence of independent Beta random variables. We compare 

^ their clustering properties with those of the Dirichlet Process and the two parameters 

Poisson-Dirichlet process. We propose the use of such sequences as prior distributions 

• ^ 

in a hierarchical Bayes modeling framework. We detail on Markov Chain Monte Carlo 

H 

^ posterior sampling and discuss the resulting inference in a simulation study, compar- 

ing their performance with that of popular Dirichlet Processes mixtures and Hidden 
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Markov Models. Finally, we discuss an application to the detection of chromosomal 
aberrations in breast cancer using array CGH data. 

AMS CLASSIFICATION : Primary 62C10; secondary 62G57 

KEYWORDS : Bayesian non-par ametrics, Species Sampling Priors, Predictive Prob- 
ability Functions, Random Partitions 

1. INTRODUCTION 

Bayesian nonparametric priors have become increasingly popular in applied statistical 
modeling in the last few years. Examples of their wide area of applications range from 
variable selection in genetics (Kim et al., 2006) to linguistics (Teh, 2006b; Wallach et al., 
2008), psychology (Navarro et al., 2006), human learning (Griffiths, 2007), image seg- 
mentation (Sudderth and Jordan, 2009) and applications to the neurosciences (Jbabdi 
et al., 2009). See also Hjort et al. (2010). The increased interest in non-parametric 
Bayesian approaches is motivated by a number of attractive inferential properties. For 
example, Bayesian NP priors are often used as flexible models to describe the het- 
erogeneity of the population of interest, as they implicitly induce a clustering of the 
observations into homogeneous groups. Such a clustering can be seen as a realization 
of a random partition scheme and can often be characterized in terms of a species 
sampling (SS) allocation rule. More formally, a SS sequence is a sequence of random 
variables Xi,X2, ■ ■ ■ , characterized by the predictive probability functions (PPF), 

n 

P{X„+ie- \Xi, . . . ,Xn} = Y,1nd^Xji-) + 1n,n+lGo{-), (1) 

where Sx{-) denotes a point mass at x, Qnj (j = 1, . . . , n-|-l) are non-negative functions 
of {Xi, . . . , Xn), such that Yll=i Qnj = 1, and Go is a non-atomic probability measure 
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(Pitman, 1996b). Collecting the unique values of Xj, (1) can be rewritten as 

K„ 

G • . . . =^q*6x;i-) + QK^+iGoi-), (2) 

where Kn is the (random) number of distinct values, say {Xf, . . . X'^^), in the vector 
X(n) = {Xi,...,Xn) and are suitable non-negative weights. In particular, an 
exchangeable SS sequence is characterized by weigths q* that depend only on n„ = 
{nin, . . . , nK„n), where rijn is the frequency of X* in X{n) (Fortini et. al, 2000; Hansen 
and Pitman, 2000; Lee et al., 2008). The most well known example of predictive rules of 
type (1) is the Blackwell MacQueen sampling rule, which implicitly defines a Dirichlet 
Process (DP, Blackwell and MacQueen, 1973; Ishwaran and Zarepour, 2003). The 
predictive rule characterizing a DP with mass parameter 6 and base measure Go{-), 
DP{d,Go), sets qn,j = ^ and g„,„+i = ^ in (1). 

Whenever the weights q*{nn) and (7|^^^^(n„) do not depend only on n„, the se- 
quence {Xi,X2, . . .) is not exchangeable. Models with non-exchangeable random par- 
titions have recently appeared in the literature, e.g. to allow for partitions that depend 
on covariates. Park and Dunson (2007) derive a generalized product partition model 
(GPPM) in which the partition process is predictor-dependent. Their GPPM general- 
izes the DP clustering mechanism to relax the exchangeability assumption through the 
incorporation of predictors, implicitly defining a generalized Polya urn scheme. Miiller 
and Quintana (2010) define a product partition model that includes a regression on 
covariates, allowing units with similar covariates to have greater probability of being 
clustered together. Arguably, the previous models provide an implicit modification of 
the predictive rule (1) where the weights can be seen as function of some external pre- 
dictor. Alternatively, other authors model the weights gj(n„) explicity, for instance, by 
specifying the weights as a function of the distance between data points (Dahl et al., 
2008; Blei and Frazier, 2011). However, the general properties of the random partitions 
generated by such processes have not been specifically addressed. 
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In this paper, we discuss a general family of non-exchangeable species sampling pro- 
cesses, where the weights are specified sequentially and do not depend on the cluster 
sizes, but instead on the realizations of a set of latent variables. We propose a simple 
characterization of the weights in the predictive probability function as a product of 
independent Beta random variables. This strategy leads to a well-defined random allo- 
cation scheme of the observables and the resulting sequence, which we call Beta-GOS 
process, is a special case of Generalized Ottawa Sequence (GOS), recently introduced 
by Bassetti, Crimaldi and Leisen (2010). We discuss the properties of the Beta-GOS 
process, with particular regard to the clustering induced on the observables. More 
specifically, we study the asymptotic distribution of the (random) number of distinct 
values in the sequence, Kn, for some natural specifications of the weights. We compare 
those results with the well-known asymptotic results characterizing the DP and the 
two-parameters Poisson Dirichlet process. 

In many applications, the NP prior is used within hierarchical models to specify the 
prior distribution of the parameters of the sampling distribution. This is the setting 
proper of widely used mixtures of Dirichlet Processes. Similarly, the Beta-GOS process 
can also be used to define a prior in a hierarchical model. We outline the basic steps 
of the MCMC sampling required for posterior inference, which are based on a Gibbs 
sampling scheme recently proposed by Blei and Frazier (2011). 

In a set of simulation examples, we compare the behavior of the Beta-GOS model 
with that of DP mixtures and Hidden Markov Models (HMM) in terms of cluster 
estimation. Our results suggest that the Beta-GOS can be seen as a robust alternative 
to the Dirichlet process when exchangeability would be hardly justified in practice, 
but still there's a need to describe the heterogeneity of our observations by virtue of 
an unsupervised clustering of the data. The Beta-GOS seems to provide a possible 
alternative also to customary HMM, especially when the number of states is unknown 
or the transition probabilities may be expected to vary with time. 

Finally, we analyze a published data set of genomic and transcriptional aberrations 
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in a sample of 145 primary breast tumors (Chin et al., 2006). Bayesian models for 
Array CGH data have been recently investigated by Guha et al. (2008), DeSantis et al. 
(2009), Du et al (2010) and Yau et al. (2011), among others. Guha et al. propose a four 
state homogenous bayesian HMM to detect copy number amplifications and deletions 
and partition tumor DNA into regions of relatively stable copy number. DeSantis 
et al. extend this approach and develop a supervised Bayesian latent class approach 
for classification of the clones that relies on a heterogenous hidden Markov model to 
account for local dependence in the intensity ratios. In a heterogeneous hidden Markov 
model, the transition probabilities between states depend on each single clone or the 
the distance between adjacent clones (Marioni et al., 2006). Du et al. propose a sticky 
Hierarchical DP-HMM (Fox et al., 2011; Teh et al., 2006a) to infer the number of states 
in an HMM, while also imposing state persistence. Yau et al. (2011) also propose a 
Nonpar ametric Bayesian HMM, but use instead a DP mixture model to model the 
likelihood in each state. With respect to those proposals, we also assume that the 
number of states is unknown, as it is typical in a bayesian nonparametric setting, 
but we don't need a parameter to explicitly account for state persistance. Indeed, the 
Beta-GOS model is "nonhomogenous" in nature, as the weights in the species sampling 
mechanisms adapt to take into account the local dependence in the clones' intensities. 
The MCMC algorithm is quite straightforward, and scales well to the large datasets 
typical of array CGH data. We show that the Beta-GOS is able to identify clones that 
have been linked to breast cancer pathophysiologies in the medical literature. 

The outline of this paper is as follows. In Section 2, we introduce the Beta-GOS 
prior. In Section 3 we discuss the general clustering properties of the Beta-GOS pro- 
cesses, and compare their behavior to the one parameter and two parameter Dirichlet 
processes. In Section 4, we present a hierarchical model where a Beta-GOS process 
prior is assumed for the parameters of the sampling distribution and discuss the MCMC 
sampling algorithm for posterior inference. In Section 5, we illustrate the simulation 
studies and in section 6 the application to chromosomal aberrations in breast cancer. 
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We conclude with some final remarks in Section 7. More technical details and proofs 
are contained in the Appendix. 

2. THE BETA-GOS PROCESS. 

As anticipated, the Beta-GOS process is defined by a modification of the predictive 
rule that characterizes the species sampling mechanism (1), where the weights are a 
product of independent Beta random variables. More in general, we start considering 
a sequence of random variables {Xn)n>i characterized by the predictive distributions 

n 

P{Xn+i G • \X{n),W{n)} = Y,Pn,,^x,{-) + rnGo{-) (3) 

where W{n) = {Wi, . . . , Wn) is a vector of independent random variables taking 
values in [0, 1], and the weights are defined by 

n n 

Pn,j:={l-W,) H Wi, rn:='[[W,. (4) 

i=j+l 2=1 

The prediction rule (3) defines a special case of a Generalized Ottawa Sequence (GOS), 
introduced in Bassetti, Crimaldi and Leisen (2010), a type of Generalized Polya Urn 
sequences where the reinforcement is randomly determined by the realizations of a 
latent process (see also Guha, 2010, for an alternative proposal). Except from a few 
special cases, the Xj's in a GOS are not exchangeable. However, it can be shown that 
these sequences maintain some properties typical of exchangeable sequences. Most no- 
tably, any GOS is conditionally identically distributed (CID), i.e. for all n > 0, the 
Xn+j^s, j > 1, are identically distributed, conditionally on (Xi, . . . , X„, Wi, . . . , Wn)- 
Hence, the Xj's are also marginally identically distributed. Note that a CID sequence 
is not necessarily stationary. If a CID sequence is also stationary then it is exchange- 
able. Finally, although no representation theorem is known for CID sequences, it can 
be shown that given any bounded and measurable function /, the predictive mean 
E[f{Xn-{-i)\Xi, ...,Xn] and the empirical mean ^ Yli^=i -^i converge to the same limit 
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as n goes to infinity. For details, see Berti, Pratelli and Rigo (2004), where CID se- 
quences have been first introduced. The predictive (3) reduces to known cases with a 
suitable choice of the latent Wns; for instance if Wn := (6 + n — l)/{9 + n), then (3) 
coincides with the Blackwell-MacQueen sampling rule characterizing a DP{6,Gq). 

In this paper, we propose iWn)n>i be a sequence of independent Beta(a„,/3„) ran- 
dom variales and we call the resulting (Xi, X2, . . .) a Beta-GOS sequence. The choice of 
Beta latent variables allows for a flexible specification of the species sampling weights, 
while retaining a simple and interpretable model together with computational sim- 
plicity (see later sections). The allocation rule can also be described in terms of a 
preferential attachment scheme, where each observation is attached to any of the pre- 
ceding by means of a "geometric-type" assignment. In this scheme, every individual 
Xi is characterized by a random weight (or "mark"), 1 — Wj. We can interpret each 
individual mark as an individual specific attractivity index, as it determines the prob- 
ability that the next observation will be clustered with Xi. More precisely, the first 
individual is assigned a random value (or "tag") Xi, according to Gq. Now, suppose we 
have Xi, . . . , X„ together with their marks up to time n, (1 — Wi, . . . , 1 — Wn)- Then, 
the (n + l)-th individual will be assigned the same tag as X„ with probability 1 — Wn] 
the probability of pairing Xn+i to Xn-i will be TV„(1 — VF„_i), and so forth. In general, 
Pn,j will be the product of the repulsions Wi for the latest n — j subjects and the jth 
attractivity 1 — Wj. Summarizing, Xn+i will result in a new tag (i.e., Xn+i ~ Go) with 
probability r„, or will be clustered together with a previously observed tag, say X^, 
with probability J2j-Xj=xi Pn,j- the next section, we discuss the clustering behavior 
induced by different specifications of the Beta weights in more detail. 

3. CLUSTERING BEHAVIOR OF THE BETA-GOS. 
The predictive rule (3) implicitly defines a random partition of the set {1, . . . ,n} into 
Kn blocks. In probability theory, Kn is also called as the length of the partition. 
Knowledge of the behavior of Kn is useful to understand the clustering structure im- 
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plied by (3). For instance, for a DP{9,Go), it's well-known that Kn/ log{n) converges 
almost surely to a constant, indeed the mass parameter 9. This asymptotic behavior is 
sometimes described as a "self-averaging" property of the partition (Aoki, M., 2008). 
From a practical point of view, since i^ri/log(n) converges to a constant, then in the 
limit Kn is essentially ^log(n); thus, for modeling purposes it suffices to consider only 
the first moment of Kn- In the case of the two parameter Poisson Dirichlet process the 
length of the partition Kn (suitably rescaled) converges instead to a random variable. 
More precisely, for a PD{a,9), with < a < 1, ^ > —a, then Kn/n^ converges a.s. 
to a strictly positive random variable (see Theorem 3.8 in Pitman, 2006). Therefore, 
the PD sequence is non self-averaging. When the limit of Kn is essentially a random 
variable, extra care is needed in the prior assessment of the parameters of the NP prior, 
since the clustering behavior is ultimately governed by the whole distribution of the 
limit random variable. For the Beta-Gos process, we focus on the following two cases: 

(i) a„ = a > and /3„ = 6 > for all n > 1; 

(ii) a„ = 6* - 1 + n (61 > 0) and /?„ > 1 for all n > 1 . 
Then, we can prove the following 

Proposition 1. Let Kn be the length of the partition induced by a Beta-Gos, with Gq 
non-atomic and Wn ~ Beta{an, (3n) > Ij- 

(a) If On = n + 9 — 1, (3n = 1, for given 9 > 0, Kn/log{n) converges in distribution 
to a Gamma{9, 1) random variable. 



(b) Ifan = n + 9- 1, /?„ = /3, (6* > 0, /3 > 1) or if a„ = o, ^„ = (a > 0, 6 > 0), 



then Kn converges almost surely to a finite random variable K^o ■ In particular, 
if Oin = a, /3n = b, then 



m 



(a)(-'') 



E[e 



-tK, 



OO 



= e 



(a + 6)(j) - (a)(j) 



m>0 j=l 
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where (t)'^"'^ = t{t + 1) . . . {t + m - 1) and 



m 




E[KUKoo - 1) . . . (i^oo - m + 1)] = m! J] 



(a + 6)0') - (a)0-)' 



The proof is detailed in the Appendix, where we also provide a general formula for 
the probability distribution, the A:-th moment and the generating function of Kn- The 
result in Proposition 1(a) represents a case of a quite natural (non exchangeable) 
partition model for which the length Kn scale as log(n) but is not self-averaging. 
When On = a, Pn = b, according to Proposition 1(b), the convergence of Kn to a finite 
random variable naturally implies the creation of a few big clusters, as n increases. 
Instead, for an = n + 9 — l,l3n = ^, the mean length of the partition depends on the 
value of 0, since a bigger number of clusters is associated on average with greater values 
of 9. However, as 9 increases so does the asymptotic variability of Kn', therefore, in 
this case, a Beta-GOS prior can be used to represent uncertainty on Kn (by the lack 
of the self averaging property of the process). By means of simulations, we have also 
confirmed that, for small values of 9, the partition of n elements is skewed, i.e. it is 
characterized by a small number of big clusters as well as a few small clusters. As 9 
increases, the sizes of the clusters decrease accordingly, the observations being grouped 
into clusters of relatively fewer elements. This is similar to what happens for the DP, 
and indeed in this case the parameter 9 could be interpreted as a mass parameter for 
the Beta-Gos. 

The parameters of the Wj's can be chosen to model the autocorrelation expected a 
priori in the dynamics of the sequence. The probability of a tie may decrease with n 
and atoms that have been observed at farthest times may have a greater probability 
to be selected if they have also been observed more recently. Such considerations may 
be helpful to guide prior assessment of the Beta hyperparameters. For given n > 1, 
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taking expectations with respect to the weights VFj's we obtain 

n „ n 



Under (a), it follows that ^[r„] = (a/(a + 6))" and S[p„,fc] = (a/(a + 6))"-^(6/(a + 6)); 
hence, the probabilities of ties depend only the lag n — k and decrease exponentially 
as a function oi n — k. Under (b), 

F\ 1 = 17 ■^' + ^-^ ^ T{e + n)T{9 + /3) 
1=1 i + ^ - 1 + /3 ne + p + n)T{B) 

1 _ /3 A J + g-1 _ r(g + n)r(g-l + /3 + fc) 

^LPn,fcJ - -/5 + /3 + n)r(0 + A;) ^^-i^-.-^ri. 

Thus, for n,k ^ +00, £'[r„] ~ ^ and E[pn^k] ~ For example, if ^ = 1 and /3 = 2, 

then = j and /3j = 2 and ^[r„] = (i+„)^(2+„) , ^[Pn,fc] = („ff)|„+2) ' A: = 1, . . . , n, so 
that the weights decrease linearly as a function of the lag n — k. If Uj = 9 — 1+j (9 > 0) 
and /3j = 1 then -E[r„] = and = Q^,k = 1,...,?7,, i.e. any observation 

has the same weight. This latter specification leads to an expression similar to that in 
the Blackwell-McQueen Polya Urn characterization of the Dirichlet process; however, 
this identity is true only in expectation, and the clustering behavior of the DP and 
Beta-GOS prior with Uj = 9 — 1 + j and /3j = 1 may be quite different, as it is evident 
from Proposition 1. 

In practice, the determination of the parameters of the Beta distributions is not 
trivial, and may be problem dependent. For example, if we want to represent a short 
memory process, constant a and /3 may be preferred. Since E{Kn) = 1 + Yl^=i -^kn], 
we could elicit the parameters on the basis of the expected number of clusters E{Kn) = 
^^(1 — ), or we may impose that with high probability (say, 7) we expect 

that the next observation will be chosen among the last J, i.e. X]/=o -^(^'"."-i) — ^• 
Further refinements may lead to consider second moments, or imposing constraints on 
the expected autocorrelation of the sequence. Instead, when a„ = n + — 1,/3„ = 1 
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then E{Kn) = Yl^=o e+j ~ ^log(n) for large n; hence, such a choice may be adequate 
to model a long memory process. Given the similarity with the DP and its dependence 
on a single parameter, this parameterization may be considered the default choice in 
many applications. 

Finally, we note the functional form of (4) may initially suggest a relationship 
with the stick-breaking characterization of the Dirichlet process. However, the stick- 
breaking construction characterizes the representation of the DP as a random measure, 
not the corresponding PPF. Furthermore, the sequence generated by a DP is exchange- 
able, whereas a Beta-GOS in general is not and includes the DP as a special case. As 
a matter of fact, if we would like to stress the "stick-breaking" analogy anyway, we 
should more properly interpret the PPF (3) in terms of an inverse stick-breaking, since 
each pnj, which defines the probability of a tie, say Xn+i = Xj, does not depend on 
the Wj's observed before time j, j = 1, . . . ,n, whereas the probability of choosing a 
new tag depends only on the part of the stick that is left at time n. This is evident 
if we consider the alternative characterization of (3) with pnj = WjYYi^j_^_^{l — Wi) 
and rn{Wi, . . . ,Wn) = nr=i(-'- ~ ~ Beta(aj,/3j) and choose a„ = 1 and 

/3„ = as in the DP. Then, p^j = Wj nr=j+i(l " ^0, j = 1, • • • ,n. For n = 3, 
P3,i = Wi{l — W2){1 — W3), p3^2 = ^2(1 — W3), p3,3 = W3. By contrast, in a Dirichlet 
process each piece of the unitary stick is defined from what is left by the previous ones. 



4. A BETA-GOS HIERARCHICAL MODEL 
In many applications, Bayesian Nonparametrics priors are used in hierarchical models, 
to model the parameters of the sampling distribtuion in a flexible way. In this section, 
we show how the Beta-GOS process could be used as a prior in a hierarchical model, 
and we discuss a straightforward MCMC sampling algorithm for posterior inference. 
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4.1 The hierarchical model. 

Beta-GOS priors can be used to model dependencies between non exchangeable obser- 
vations. Let Y = (Yi, . . . ,Ym)'^ be a vector of observations, e.g. a time series. Then, 
following a Bayesian approach, we can assume that the data can be described by a 
hierarchical model as 

i=l,...,m, (6) 

for some probability density where the vector (^ui, . . . , fim)'^ is a realization of 

a Beta-GOS process with parameters Oi, Pi , i = 1, . . . ,m, and base measure Gq, which 
we succintly denote as 

Beta-GOS(a^,/3„,Go), (7) 

i.e. is a sample from a random distribution characterized by the predictive rule (3), for 
some Wi ~ Beta{ai, fii), i = 1, . . . ,m. As noted in section 2, any Beta-GOS dehnes 
a CID sequence. In particular, marginally Hi ~ Go, i = 1, . . . ,m. Therefore, Gq can 
be regarded as a centering distribution, as in DP mixture models: Gq can represent a 
vague parametric prior assumption on the distribution of the parameters of interest. 
We conclude this section by noting that the sequence Yi, I2, • • defined through (6) 
and (7), with joint density 

/m 
Y[p{yi,\fii)7r{dfii,. . . ,dnm), m > 0, 
i=i 

and 7r(-) = Beta-GOS(am, /9m, Gq), is also a CID sequence. Therefore, although 
not exchangeable, the 1^+j's, j > 1 are conditionally identically distributed given 
(Yi, . . . , Yn, fii, . . . , fin)- For a proof of this statement, see Proposition 4 in the Ap- 
pendix. 
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4.2 MCMC posterior sampling. 

Posterior inference for tlie model (6)- (7) entails learning about the vector of random 
effects m and their clustering structures. As the posterior is not available in closed 
form, we need to revert to MCMC sampling. In this section, we describe a Gibbs 
Sampler that relies on sampling the subsequent cluster assignments of the observa- 
tions Yi, . . . ,Ym according to the rule (3). To do this, the partition structure will be 
described by introducing a sequence of labels {Cn)n>i recording the pairing of each 
observation according to (3), i.e. which other data point, among those with index 
j < i, the ith. observation has been matched to. Hence, here the label Cj is not a 
simple indicator of the cluster membership, as it is typical in most MCMC algorithms 
devised for the Dirichlet process, although cluster membership can be easily retrieved 
by analyzing the sequence of parirings. In what follows, Ci will be sometimes referred 
to as the i-th pairing label. In particular, if the i-th observation is not paired to any of 
those preceding, Ci = i; in this case, the i-th point consists in a draw from the base dis- 
tribution Go, and thus generates a new cluster. This slightly different representation of 
data points in terms of data-pairing labels, instead of cluster-assignment labels, turns 
useful to develop an MCMC sampling scheme for non-exchangeable processes, as it has 
been thouroughly discussed in Blei and Frazier (2011), who have shown that such rep- 
resentation allows for larger moves in the state space of the posterior and faster mixing 
of the sampler. It is easy to see that the pairing sequence (C„)„>i assigns Ci = 1 and 
has distribution 

P{Cn = i\Ci,...,Cn-l,W}= P{Cn = i\Wi, . . . , Wn-l} 

(8) 

= r„_il{i = n} + pn-i,il{i / n}, 

for i = 1, . . . ,n, where II(-) denotes, as usual, the indicator function, such that, given 
a set A, I{i £ A) = I i & A and otherwise. As mentioned, the clustering config- 
uration is a by-product of the representation in terms of data-pairing labels. If two 
observations are connected by a sequence of interim pairings, then they are in the 
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same cluster. Given C = (Ci, . . . , Cm, • • • )) l^t n(C) denote the partition on N gen- 
erated by C. Accordingly, if (/^^)fc>i is a sequence of independent random variables 
with common distribution Go, we set = ^u^ if i belongs to n(C)fc, i.e. the k- 
th block (cluster) of n(C). For any m and any i < m, let C(m) = (Ci,...,Cm), 
C-i = (Ci,...,Cj_i,Cj+i,...,Cm); analogously, let W{m) = {Wi,. . . ,Wm), and 
W-i = {Wi, . . . , Wi-i, Wi+i, . . . , Wm)- Then, the full conditional for the pairing indi- 
cators Cj's is 

P{Ci = j|C_i,y(m), W{m)} oc P{Ci = j, Y{m)\C^i, W{m)} 

(9) 

= P{Yim)\Ci = j, C-i, Wim)}P{Ci = j\C-^, W{m)}. 

The second term in (9) is the prior predictive rule (8), whereas 

|n(c_„j)| 

P{Yim)\Q=j,C.„Wim)}= U U P(^l/^PGo(d/x*), 

k=i «en(c_,j\, 

where n(C_j, j) denotes the partition generated by (Ci, . . . , Cj_i, j, Cj+i, . . . , Cm)- If 
Go and p{y\fJ-) are conjugate, the latter integral has a closed form solution. The non- 
conjugate case could be handled by appropriately adapting the algorithms of MacEach- 
ern and Miiller (1998) and Neal (2000). Instead, we believe that split and merge moves 
as the ones considered in Jain and Neal (2007) and Dahl (2005) are more problematic 
to implement given the implied exchangeability of the clustering assignments in those 
algorithms. As far as the full conditional for the latent variables Wj's, we can show that 
W^\C{m),W.i,Y{m) ~ Beta{Ai,Bi), where A = ai + ET=i+iHCj < i or Cj = j}, 
and Bi = I3i + X^JLi ^{Cj = i}; hence, they depend on only on the clustering configu- 
rations and not on the values of W-i. 

Finally, consider the set of cluster centroids /x*'s. The algorithm described so far allows 
faster mixing of the chain by integrating over the distribution of the n*. However, in 
case we were interested on inference on the vector (^ui, . . . , fim), it's possible to sample 
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the unique values at each iteration of the Gibbs sampler, from 

P{^^*\C{m),W{m),Y{m)]^ J] P(>il/^,*)Go(d/" *), (lo) 

ienj(m) 

where Iij{m) denotes the partition set of the observations such that /ij = /x^, i = 
1, . . . ,m. Again, if p(y|/x) and Go are conjugate, the full conditional of /U* is available 
in closed form, otherwise we can update ^* by standard Metropolis Hastings algorithms 
(Neal, 2000). 

5. A SIMULATION STUDY 
In this section, we provide a full specification for model (6)-(7) and test the properties 
of the Beta-Gos prior on a set of simulated examples; more specifically, we develop 
some comparison with the Dirichlet Process and popular Hidden Markov Models. 

5.1 Model specifications 

Throughout this section, model (6)-(7) will be specified as follows. First, we assume a 
gaussian distribution for the observables, Yi ~ N{fj,i, r^). The base measure Go is also 
assumed normal, N{ij,q,cjq), and Inv-Ga(ao, 6o). The parameters of the latent 

Beta reinforcements, Wi ~ Beta {ai,f3i), are separately indicated in each simulation 
and are chosen to allow for a range of prior beliefs on the clustering behavior of the 
process (see Section 3). Details of the MCMC algorithm for posterior inference and 
parameter estimation in the Beta-GOS model are given in Appendix A. 

5.2 Parameter estimation and robustness to model mis-specification 

We use a set of simulated datasets to assess the inferential properties of the Beta-GOS 
model, and provide a comparison with popularly used mixtures of DPs. 

A first simulation study considers an ideal setting. We generate 1000 samples of 101 
observations each from the Beta-Gos model (6)-(7), where we set = n, jSn = 1. The 
first 100 points are used for fitting purpose, whereas the 101st point is used to assess 
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goodness of fit. Without much loss of generahty, we fix /ig = and ctq = 10, r = 0.25 in 
order to distinguish the sample variability from the variability of the base measure. We 
fit the data with a Beta-Gos hierarchical model, with ao = 10, Beta hyper-parameters 
On = n, Pn = 1, and ~ Inv-Ga(2. 004, 0.0063). This choice of the Inverse-Gamma 
hyper-parameters allows to have mean around 0.25^ and relatively large variance. In 
addition, we fit a mixture of DP model with concentration parameter 6 = 1, which on 
the basis of Proposition 1 (a) can be seen as compatible with the parameters used in our 
model. The mixture of DP model is implemented using the R package "DPpackage" 
(Jara A. , 2007). 

The results of this simple simulation study are summarized in the first columns of 
Table 1. Table 1 reports four summary statistics aimed at providing synthetic measures 
of the goodness of fit; namely, number of clusters, accuracy of cluster assignments, 
predicive bias and estimates of the sample variability. Following the machine learning 
terminology for classification performance metrics, we call accuracy the ratio of the 
correct cluster assignments with respect to the total of assignments. We compute the 
predictive bias as follows: for each sample, and each MCMC output, we predict a new 
observation on the basis of the estimated parameters and the clustering configurations 
provided by the algorithm, say l^red- The prediction is compared with the original 
value, Yioi- The predictive bias is simply the average of |lioi — Ypredl, and can be 
regarded as a measure of how well the model can predict future observations. For 
the Beta-GOS data, nearly 96% of all data points were assigned to correct clusters. 
Most of the error is intrinsic to the data generating process. Indeed, as typical of 
most bayesian nonparametric models (including the DP), the ability of the model and 
estimation algorithms to recover the ground truth may be affected by the choice of the 
relative magnitudes of the hyper-parameters and r^. With respect to the measure 
of the predictive bias, though, the relatively poor performance of the mixture of DP 
suggests that here the exchangeable model is indeed unable to take fully advandage of 
the underlying structure of the data generating process. 
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A second simulation study is designed to assess the robustness of the Beta-GOS 
framework to model mis-specifications. More specifically, we first generate 1,000 data 
sets (101 observations each) from a Normal mixture model with five components, where 
the components are Normal with standard deviation r = 0.25 and means sampled from 
a N{fiQ = 0, ctq = 10). The vector of mixture components' weights is chosen at tt = 
(0.2,0.35,0.15,0.1,0.2)^. In addition, we generate 1,000 datasets from a "truncated" 
Polya Urn model, i.e. a process characterized by a predictive distribution similar to the 
Dirichlet Process, except that for a given lag i > k, we set pn,n-i = 0. Therefore, the 
"truncated" Polya Urn describes a situation where a cluster atom cannot be sampled 
again if none of the last k observations has been assigned to that cluster. In contrast, 
in the Beta-Gos (and MDP) model, there's always a positive probability to re-assign an 
observation to a cluster that has not been recently observed, although this probability 
may depend on the WiS. To fully specify the restricted Polya Urn model, we assume 
that its "base measure" is A^(/io = 0, do = 10), the lag is A; = 3, and set r„ = g, pn^n = f 
and Pn.n-i = Pn,n-2 = FoT both data generating mechanisms, we fit a Dirichlet 
Process (6 = 1), and a Beta-Gos process, with a) a„ = /3„ = 1, and b) a„ = n, (3n = 1- 
Case (a) corresponds to a process with short autocorrelation expected a priori and, 
asymptotically, a finite number of clusters, whereas case (b) assumes that the rescaled 
number of clusters, -fir„/ log(n), converges to a Gamma{l, 1), and E[Kn] ~ log(n). The 
results of the simulations are shown in Table 1. Overall, the Beta-GOS framework is 
quite robust to model mis-specifications. For the mixture of Gaussians data, accuracy of 
cluster assignments was high (91%) and comparable to that of the DP; correspondingly, 
parameters' estimates were close to the true parameter values. In comparison with 
the Dirichlet Process, the Beta-GOS generally exhibits a smaller predictive bias. The 
advantage of the Beta-GOS process seems to be more evident when the true generation 
mechanism is non-exchangeable, as with the "trunctated" Polya Urn. In such case, the 
Beta-GOS seems to lead to more precise estimation of the parameters and cluster 
assignments as well as slightly better predictions than the DP. 
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Finally, we note that in our simulations, posterior inference for the Beta-GOS process 
seemed only minimally affected by the two different specifications of the parameters 
of the Beta weights. This consideration seem to confirm the suggestion that using 
an = n-\-9 — 1, f3n = l may represent a default choice in many applications, where 9 
can be chosen or estimated similarly as what is routinely done for mixture of DPs. 

5.3 Fitting mixtures of hidden Markov models 

In this section, we consider non-exchangeable sequences generated from a mixture 
of two hidden Markov processes. More specifically, we generate 1,000 datasets (100 
observations each) using hidden Markov processes with four states. For each data set, 
the transition matrix for the first 50 observations is assumed to be "sticky", i.e. there's 
a high probability to remain in the same state. In particular, we set a transition matrix 
with 0.91 in the main diagonal and 0.03 elsewhere. The remaining 50 observations are 
generated from a transition matrix with columns: (0.4,0.4,0.1,0.1); (0.4,0.4,0.1,0.1); 
(0.1,0.1,0.4,0.4); (0.1,0.1,0.4,0.4); i.e., it describes a process with frequent changes of 
states. We fit the data by means of a Beta-GOS model with Beta hyper-parameters 
defined by /?„ = 1 and, respectively, a) = "-; b) a„ = n -|- 4; c) a„ = 1; d) = 4; 
e) a„ = 9. We then compare the Beta-GOS with the fit resulting from a single hidden 
Markov model. Teh et al. (2006a), Fox et al. (2011), and Yau et al. (2011) have 
recently developed flexibile and effective hierarchical Bayesian nonparametric hidden 
Markov models that allow posterior inference over the number of states when these are 
unknown. The Beta-GOS model provides an alternative, non-exchangeable, Bayesian 
Nonparametric formalism to tackle the same type of problems. Since the Beta-GOS 
model does not rely on the estimation of a single transition matrix across time points, 
we do not neet to consider an explicit parameter to account for state persistence, as in 
Fox et al. (2011). Indeed, the species sampling mechanism that defines the Beta-GOS 
model should ensure that the sampling weights are able to adapt to possible changes in 
the underlying properties of the process. Hence, the Beta-GOS is particularly adequate 
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to describe situations where the transition matrix of a HMM may be assumed to 
vary with time, as in the examples described here. Monteiro et al. (2011) tackle a 
similar problem in a product partion model framework and explicitly assume that the 
observations in a cluster have their distributions indexed by different parameters. Our 
approach is different, for example we do not need to explicitly model the dependence 
structure within the clusters. Results from these simulations are reported in Table 2, 
where for comparison and computational simplicity, we implemented a HMM using 
the R package "RHmm" (Taramasco and Bauer, 2012), setting the number of clusters 
to be four, five or six. Table 2 shows that the Beta-GOS is a viable alternative to 
HMM, as it can provide a more accurate clustering assignment as well as more precise 
parameter estimates than single hidden Markov model with a fixed number of states. 

6. QUANTIFYING CHROMOSOMAL ABERRATIONS IN BREAST 

CANCER 

In this section, we apply the Beta-Gos to a publicly available dataset (Chin et al., 
2006) that has been used to link patterns of chromosomal aberrations to breast can- 
cer pathophysiologies in the medical literature. The raw data measure genome copy 
number gains and losses over 145 primary breast tumor samples, across the 23 chromo- 
somes, obtained using BAG array Gomparative Genomic Hybridization (GGH). More 
precisely, the measurements consist of log2 intensity ratios obtained from the compar- 
ison of cancer and normal female genomic DNA labeled with distinct fiuorescent dyes 
and CO- hybridized on a microarray in the presence of Got-1 DNA to suppress unspecific 
hybridization of repeat sequences (see Redon et al., 2009). The analysis of array GGH 
data presents some challenges, because data are typically very noisy and spatially cor- 
related. More specifically, copy numbers gains or losses at a region are often associated 
to an increased probability of gains and losses at a neighboring region. We use the 
Beta-GOS model developed in the previous sections to analyze and cluster clones with 
similar level of amplification/deletion, for each breast tumor sample and each chromo- 
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some in the dataset. More specifically, for the data analysis we considered a default 
specification of the latent Beta hyperparameters, setting a„ = n and = 1, a vague 
base distribution, A^(0, 10) and a vague inverse gamma distribution centered around 
T = 0.1. This choice is motivated by the typical scale of the array CGH data and is in 
accordance with similar choices in the literature (see, for example Gulia et al., 2008). 

Figure 1 exemplifies the fit to chromosome 8 on two tumor samples. The model 
is able to identify regions of reduced copy number variation and high amplification. 
Note how contiguous clones tend to be clustered together, in a pattern typical of 
these chromosomal aberrations. Figure 2 replicates Figure 1 in Chin et al. (2006) 
and shows the frequencies of genome copy number gains and losses among all 145 
samples plotted as a function of genome location. In order to identify a copy number 
aberration for this plot, for each chromosome and sample, at each iteration we consider 
the cluster with lowest absolute mean and order the other clusters accordingly. The 
lowest absolute mean is chosen to identify the copy neutral state. Following Guha et al. 
(2008) any other cluster is identified as a copy number gain or loss if its mean, say 
is farther than a specified threshold from the minimum absolute mean, say i-^- if 
^(i) ~ /^(i) ^ ^' experimented with a range of choices of e in the range [0.05,0.15] 
and used e = 0.1 for the current analysis. Furthermore, if the mean of a cluster is 
above the mean of all declared gains plus two standard deviations, all genes in that 
cluster are considered high level amplifications. We identify a clone with an aberration 
(or high level amplification) if it is such in more than 70% of the MCMC iterations; 
then, we compute the frequency of aberrations and high level amplifications among all 
145 samples, which are reported, respectively, at the top and bottom of Figure 2. As 
expected, the clusters identified by the model tend to be localized in space all over 
the genome. This feature may be facilitated by the increasingly low reinforcement of 
far away clones embedded in the Beta-GOS, and corresponds to the understanding 
that clones that live at adjacent locations on a chromosome can be either amplified or 
deleted together due to the recombination process. 
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Figure 1: Model fit overview: Array CGH gains and losses on chromosome 8 for two samples 
of breast tumors in the dataset in (Chin et al., 2006). Points with different shapes denote 
different clusters. 
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Figure 2: A) Frequencies of genome copy number gains and losses plotted as a function of 
genomic location. B) Frequency of tumors showing high-level amplification. The dashed 
vertical lines separate the 23 chromosomes. 
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Finally, we considered some regions of chromosomes 8, 11, 17, and 20 that have been 
identified by Chin et al. (2006) and shown to correlate in their analysis to increased 
gene expression. We adapt the procedure described in Newton et al. (2004) to compute 
a region-based measure of the false discovery rate (FDR) and determine the (/-values 
for neutral-state and aberration regions estimated in our analysis. The g-value is the 
FDR analogue of the p-value, as it measures the minimum FDR threshold at which 
we may determine that a region corresponds to significant copy number gains or losses 
(Storey, 2003; Storey et al., 2007). More specifically, after conducting a clone based 
test as described in the previous paragraph, we identify regions of interest by taking 
into account the strings of consecutive calls. These regions then costitute the units 
of the subsequent cluster based FDR analysis. Alternatively, the regions of interest 
could be prespecified on the basis of the information available in the literature. The 
optimality of the type of procedures here described for cluster based FDR is discussed 
in Sun et. al, 2012. See also Heller et al., 2006, Miiller et al., 2007 and Ji et al., 
2008). In Table 3 we report the g- values from a set of candidate oncogenes in well- 
known regions of recurrent amplification (notably, 8pl2, 8q24, llql3-14, 12ql3-14, 
17q21-24, and 20ql3). Our findings confirm the previous detections of chromosomal 
aberrations in the same locations, suggesting that the Beta-Gos model could be used 
in the analysis to complement tumor sub-type definition, or suggest candidate genes 
with similar aberration patterns for follow-up clinical studies. 

7. CONCLUDING REMARKS. 
We have considered the class of Generalized Ottawa Sequences as a way to define a 
non-exchangeable random partition, starting from the characterization of a Species 
Sampling prior in terms of its predictive probability functions. More precisely, we have 
considered predictive rules where the weights are functions of latent Beta random vari- 
ables. We have discussed the clustering behavior of the resulting Beta-GOS processes 
for some specifications of the latent Beta densities and illustrated their use as priors 
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Table 3: False discovery rate analysis for clones with high-level amplification previously 
identified by Chin et al. (2006). The individual amplicons are reported together with the 
locations of the fianking clones on the array platform. 



Amplicon 


Flanking clone 


Flanking clone 


Kb 


Kb 


FDR 




(left) 


(right) 


start 


end 


q- value 


8pll-12 


RP11-258M15 


RP11-73M19 


33579 


43001 


0.043 


8q24 


RP11-65D17 


RP11-94M13 


127186 


132829 


0.033 


llql3-14 


CTD-2080I19 


RP11-256P19 


68482 


71659 


0.030 


llql3-14 


RP11-102M18 


RP11-215H8 


73337 


78686 


0.040 


12ql3-14 


BAL12B2624 


RP11-92P22 


67191 


74053 


0.010 


17qll-12 


RPll-5808 


RP11-87N6 


34027 


38681 


0.021 


17q21-24 


RP11-234J24 


RP11-84E24 


45775 


70598 


0.021 


20ql3 


RMC20B4135 


RP11-278I13 


51669 


53455 


0.032 


20ql3 


GS-32I19 


RP11-94A18 


55630 


59444 


0.030 



in a hierarchical model setting. Finally, we have discussed the performance of this 
modeling framework by means of a set of simulation studies and an application to the 
detection of chromosomal aberrations in breast cancer using CGH data. 
With respect to other proposals, the Beta-GOS provides a way to model heterogene- 
ity across non-exchangeable observations that are sequentially ordered, by enabling 
clustering in a number of unknown states. Furthermore, since the predictive weights 
depend on the sequence of observations itself, the Beta-GOS seems particularly con- 
venient when the underlying generative process is non-stationary, e.g. as a possible 
alternative to more complicated nonhomogeonous HMMs. 

Arguably, the major obstacle we can foresee in the wider applicability of this type 
of models relies in the specification of the prior hyperparameters in the latent Beta 
distributions. However, our experience suggests that the default choice of the hyper- 
parameters outlined in Proposition 1(a) not only reduces the problem to the choice of 
a single parameter as it is usual in DP mixture models, but may also suffice in most 
situations. Nevertheless, the choice or sampling of the hyperparameters of the Beta 
latent variables in applications requires further study. 

Finally, we believe that the flexibility of the latent specification and the possibility 
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to tie the clustering implied by the Generalized Polya Urn scheme directly to a set 
of latent random variables gives an opportunity to further modeling the complex re- 
lationships typical of heterogenous datasets. For example, further developments may 
substitute the general latent Beta specification with a probit/logistic specification, and 
define a Generalized Polya Urn scheme in the aims of Rodriguez et al. (2010) that 
allows the clustering at each observation to be dependent on a set of (possibly se- 
quentially recorded) covariates or curves. Similarly, we can imagine using multivariate 
Generalized Polya Urn schemes of the sort we describe in this paper to model time 
dependent parameters in time series, which may be important to identify time-varying 
structures or regime changes at the base of phenomena like the so called financial con- 
tagion, i.e. the co-movement of asset prices across global markets after large shocks 
(see, for example, Liu et. al, 2012). 
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A. APPENDIX: DETAILS OF POSTERIOR MCMC SAMPLING FOR 



Here, we provide the details of the MCMC sampling algorithm described in section 
4.2 for the special case of a Normal sampling distribution and a Normal (or Normal- 
Inverse-Gamma) base measure. 

A.l Full conditionals for the Gibbs sampler 

At each iteration of Gibbs sampler we sample from the full conditionals of C„ and Wn, 
for n = 1,. . . ,N. Here we derive the analytical form of these distributions, for the 
Beta-GOS model specified in Section 5. Recall that the full conditional distribution 



THE BETA-GOS MODEL 



for Cn is 



P{Cn = i 



C^n,W{N),Y{N),T^} 



oc P{Y{N)\C, 



i, C7_„,, W{N), T^} ■ P{Cn = i\C^n, W{N)}, 
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where the factor on the right is given by (8) and (3), and the left factor is obtained by 
integration, 




j=i I en. 



J 



1 1 



exp 




2r2 



} 




1 



where Hj is the set of indices of data points in cluster j, and J is the number of 
clusters at that iteration. Note that the latent reinforcements W{N) are used to define 
the cluster assignments through the data-pairing labels C{N). Conditionally on the 
data-pairing labels C{N), the data Y{N) is independent of the latent reinforcements 



The fuU conditional for W„, denoted by P{Wn\C{N),W-n,Y{N)), is Beta dis- 
tributed with updated parameters An,Bn, defined as in (8). 

A. 2 Inference on the cluster centroids of the Beta-GOS process. 
For the purpose of computational efficiency, it's generally preferrable to sample the 
random partitions integrating out with respect to the parameters of the Beta-GOS 
process, as described in Section 4.2 and in Appendix A.l. If the sampling distribution 
and the base measure are conjugate, this usually results in improved mixing of the 
chain. However, in many cases, it may be required to draw inferences on the cluster 
centroids themselves. As usual with mixtures of DP, inference on the cluster centroids 
can be easily conducted (even ex-post) from the clustering configurations at each iter- 
ation. Therefore, we do not have to sample the centroids within each Gibbs iteration, 
but if the need be, we can easily resample them at the end of each iteration, or at the 
end of the sampler from the stored output. 



W{N). 
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A. 3 Inference on the cluster and global variances 

Let the variance of the samphng distribution be r^. We assume t"^ ~ IGamma{ao, bo). 
The posterior distribution of the variance in each cluster j, is given by 

Tj I //*, Fj, i G Uj ~ IGamma ( ao + ^"^^ + ^ (^^ ~ fJ-jf )-: 

Note that, in case of need and for computational efhciency, we could use these also 
quantities to obtain a global estimate for the sampling variance at each iteration, in 
an MCMC-EM step, as = Ylj=i n-j ^ ■ This may turn useful, for example, for 
parallelization purposes, as in the simulations of Section 5. 

A. 4 Inference on the cluster means 

In the normal-normal model described in Section 5, the posterior distribution of /i^ 
given data Yi in the j-th. cluster can be evaluated at each iteration as 



Pifi*j\TlYi,ieUj)^ N 



j 



„2 ^2 \ I 



J 




for j = 1, . . . , J, where Yj is the j-th cluster specific mean. Note that we have assumed 
a common sampling variance ; the modification of the previous formula to take into 
account a cluster specific variance is of course straightforward. 

B. APPENDIX: DETAILS OF THE PROOFS AND ADDITIONAL 
THEORETICAL RESULTS 
B.l Generalized Ottawa Sequence and its moments 

According to Bassetti, Crimaldi and Leisen (2010) a sequence (X„)„>i of random vari- 
ables taking values in a Polish space is a Generalized Ottawa Sequence (GOS) if there 
exists a sequence {Wn)n>i (of random variables) such that the following conditions are 
satisfied: (i) the law of Xi is Gq; (ii) for n > 1, Xn+i and the subsequence {Wn+j)j>i 
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are conditionally independent given the filtration '•= (^{Xi^ . . . ,Xn,Wi, . . . ,Wn); 
(iii) the predictive distribution of Xn+i given Tn is given by (3) where the r„ 's are 
strictly positive functions, rn(Wi, . . . , Wn), of the vector of latent variables, such that 

rn{Wi,...,Wn)>rn+l{Wu...,Wn,Wn+l), (A.l) 

almost surely, with rg = 1, and the weights pn^i = Pn,i(Wi, . . . , Wn) are 

Pn,i = l = l,...,n. (A. 2) 



The predictive distribution (3)-(4) corresponds to choice rn{Wi, . . . , Wn) = YYi=i Wi 
where {Wn)n>i is a sequence of independent random variables. 

We conclude this section by providing a general result for the A;-th moment and for 
the moment generating function of the length Kn of a GOS. Suppose that the sequence 
{Xn)n>i is a GOS, with Gq diffuse, and let Uj = Kj — i^j-i with Kq = 0. Then, 
Kn = J21=i ^^"i joint distribution of Ui, . . . ,Un conditionally on ri, . . . , r„_i, is 



P{Ui = !,...,[/„ = Cnln, . . .,rn~i} = Y].r^l^{l - r^^i 



1— e; 
1=2 



for every vector {e2, ■ ■ ■ , Cn) in {0,1}" ^, since P{Ui = 1) = 1 by definition. Since 
Ki = C/i = 1, it follows that, for every k > 1, 



n+l 

\l-e,; 



p{Kn^, =k+i}=p{Y,u,=k} = E^[n<-i(i 



j=2 ' e " i=l 



where the summation is extended over all sequences e = (ei, . . . , e„) in {0, 1}"" such 
that Y17=i ~ ^- Moreover, for every A; > 1 and n > 2, it is easy to see that 



71+1 , kAn 



E[{Kn+l - 1)''] = e[(^Y.^,^ ] = J^"^!5(^,H0n,m (A.3) 



j=2 m=l 
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where k An = min(A:, n), 



(t>n,m:= Yl Elni---riJ. (A.4) 

and S{k,m) := ^ Z]{„^>o:j]™ ^ ni=fc} ml.lnml ^® Stirling number of second kind. 
Hence, E[{Kn+i — 1)'^] depends recursively on functions (f>n-i,m, m = 1, . . . , k. It may 
be interesting to note that, using the well known relation between factorial moments 
and ordinary moments (see, e.g.. Example 2.3 in Charalambides, 2005), from (A. 3) one 
gets, for any m < n, 

E[{Kn+l - l)(m)] = ml<j)n^rn (A. 5) 

where = t{t — 1) . . . {t — m + 1) is the falling factorial. Moreover, since 

^ , S{k,m) ^ je-'-ir 
' k\ m! ' 

k>m 

see e.g. Thm. 2.3 in Charalambides (2005), it follows that the moment generating 
function of Kn+i is given by 

Mn+i{t) := E[e-^^-+'] = e'^ E[e-^''^-+'-^^] 

/ -f-\k k/\n / -f-\k I 

= E - 1)'] = + E E ^^ir^^(^' "^^^-'^ 

k>0 ' k>lm=l 



n 



m=l k>m 
n 



e * + e * ^ m! (t)n,m ^-^S{k, m) 



k 



m=0 

(A.6) 



with (pnfi ■■= 1- 
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B.2 Proof of Proposition 1 

If we consider equation (A. 4) with (Wi)i>i independent random variables taking values 
in [0, 1], then 

m h 

c^n,m= n n mr'-'i (a-t) 

where Iq ■= 0. We need some preliminary results. 

Lemma 2. // Wi ~ Beta{i + 6 — 1, 1), for given 6 > 0, then 

_ r(9 + m) ^ ^ ^ 1 

^n,m - ^^^^ 1^ 1^ l^--- 1^ (j-^ + 0)( + 0) . . . ( + e) ■ ^^-^^ 

ji=m ]2=m J3=m Jm=m 

In particular, as n goes to +oo, 

E[K] = ^^^^log^r^) [1 + 0(1)]. (A.9) 

Let us start by proving (A. 8). First, note that since Wi is a Beta{i + — 1,1) 
random variable then, for I < j < m, E[W[^^^~-^] = ■ Hence, by (A. 7), 

<t>n,^= E n n -itl'' (^-10) 

l<li<l2<---<l,„<nj=li=lj^i + l 

which, after some algebra, returns (A. 8). In order to prove the second part of Lemma 
2 we need to introduce additional notation. For 6>0, k>l, m>2 and n > k, set 

n 31 32 jm-i I 

Ji=k 

Note that m\(l)n,m = ^m,0i''T',m)T{9 + m)/T{6). For all A; > 1, m > 1 and n > k, set 
(5fc^e(m,n) := ^'^^^(n, m) — log'"(n + 9). Now formula (A.9) in Lemma 2 follows easily 
from (A. 3) and the next result. 
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Lemma 3. For 9 > 0, k > 1 and m > 1, there is a constant Ck.e{m) such that 



\Qk,e{'m, n)\ < Ck,e{m) log"* {n + 9) for every n>k. (A. 11) 



Let k > 1 and ^ > 0. For m > 1 and n> k set 



j=k 



mlog"'-'{j + 9) 

JT~9 ■ 



and 



n) := Sk^eim, n) - log'"(n + (9) = ^ 



" mlo g'^-\j + 9) 



j=k 



log™(n + 0). (A.12) 



We claim that, for any m > 1, there is a constant Cj^ = Cj^ g ^ such that 



|-Rfc,e("i, n)| < Cj^, for all n> k. 



(A.13) 



Now observe that ^'^^^(n, 1) = ^^^^(l,^). Hence, (A.13) proves (A. 11) for m = 1 and 
every k > 1 and 9 > 0. By induction suppose that (A. 11) is true for m = 1, . . . , M — 1. 
Note that, for m > 2, 



V ^ m 

^k,e[n, m) = > . , ^ fc,e(ji,m - 1), 
^ Ji + 9 



ji=k 



hence, by induction hypothesis, for every 9 > 0, k > 1 and n > k, 



M 



ji=k 



^ ji + 9 



Using (A.12) one gets 



M 



^k,e{n,M) = log^'{n + 9) + Rk,e{M,n)+y] ——Qkfi{M-l,h 



ji=k 
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Hence, using (A. 13) and the induction hypothesis, one can write 

" M 

\QkAM,n)\ < \Rk,e{M,n)\ + y] -^\Q,,^g{M - 

< CMfi,k + — -j^^ — 2^ log Oi + 0) 

ji=k 

< Ch,e,k + ^"^'^^f^ilog'^-'in + e) + \R,,{M - l,n)|] 

< CX,,, + ^^gM(M-l) [iogM-i(, + + 

which proves (A. 11) for m = M. To complete the proof let us prove (A. 13). Observe 
that X I—)- x+e'^^'' 1^ ^ non- increasing function on [xq, +oo) for a suitable xq = 
xo{k, 6, m). Assume, without real loss of generality, that k > xq + 1. Note that, in this 
case, 

r'""°^'"''':+'"d.<5„(,„.„)< r "'■°^-'(;+^),,, 

Hence, 

log™(n + 1 + 0) - log'"(A: + 6) < Sk,e{m, n) < log™(n + 9)- log™(A; -1 + 0), 
which gives 

Iog"^(n + 9)- \og^{k + 9)< Sk,e{m, n) < log'"(n + 9), 

and then 

|5fc,e(m, n) - log"^(n + 9)\ < \og^{k + 9). 
Proof of Proposition 1 (a). It follows immediately from (A. 9) and a classical re- 



E 



suit concerning the convergence in distribution when the moments converge. Indeed, 
(i^n) converges to ^Yie)^ that is the A;-th moment of a T{9, 1) random variable. 

Proof of Proposition 1 (b). The first part of the statement of Proposition 1(b) 
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follows from Proposition 2.1 in Bassetti, Crimaldi and Leisen (2010) if one shows that 
^iYl"^! ''i] < -f"™' O!" — ^ Pn = b one gets i?[r„] = a^/{a + 6)" and the thesis 
follows. When an = n + 9 — 1 and /3„ = /3, as explained in Section 3, S[r„] ~ n~'^ 
and the thesis follows since /? > 1. It remains to prove the assertion concerning the 
moment generating function and the factorial moments of i^oo- 
If a,i = a and /3„ = b, (A. 7) becomes 

m 

l<ll<l2<--<lm<n j = l 

m m+l—j 

= E IK n 

l<h<l2<--<lm<nj = l 1 = 1 

since = YYlLi li for 7j = {a + i — l)/{a + h+i — l). Taking the limit for n — +oo, 

we get 



m m+l—j 



Ij Ij _ 1 



iim0„,^= n( n 

l<h<l2<---<lmj = l i = l 

m m+l—j 

= EE-En( n 

fci>lA:2>l km>lj=l i=l 
m m+l—j m 



nE( n 7.)'-nr-,,...,, 

j=lkj>l i=l j=l ' '■> 



and then 



lim (t)n,m = TT 



ii(a + ?,)0-)-(a)(i) 

where (t)^^') = t{t + I) . . . {t + j - I) is the rising factorial. Combining this fact with 
(A. 6) it follows that, in this case, 



E[e-'^^] = e-* - 1)"" n 



m>0 



ii (a + 6)0-) - (a)O-) 



39 



In addition (A.3)-(A.5) give 



{Koo-lU, i\ (a)^^^ ,,,, ^M^ («) 



E\ 



(i) . A ™ rn^(i) 



m! 



J = l ^ ' ^ ' ?Tl = l J = l ^ ' 



B.3 Conditionally identity in distribution of the beta-GOS hierarchical model 
Proposition 4. The sequence {Yn)n defined by formula (6) -(7) is conditionally iden- 
tically distributed with respect to the filtration T-Ln = '^(^(n), VF(n), /x(n)). 

Proof. Let 

Gn = cj{W{n),tM{n)) 
nn = a{W{n),^,{n),Y{n)) 

We have to prove that for every real, bounded and measurable function g 

E{g{Yn+j)\nn) = E{g{Yn+i)\nn) (A.14) 

Now, for every j > 

) (A.15) 

and for every j and n 

C{^n+j I y-n) = ^{fJ-n+j \ Qn) (A.16) 

As already recalled, (/in)n is CID with respect to Qn = a{W{n)., ^{n)). This means 
that for every real, bounded and measurable function / 

E{f{^ln^,)\gn) = E{fif,n+l)\gn) (A.17) 

for ah j > 1, see Berti, Pratelh and Rigo (2004). Thanks to (A.16), equality (A.17) 
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also holds with respect the sigma- field T-Ln- Indeed, 

EU{^Jin+J)\'Hn) = E{f{fl„.+j)\Gn) = E{f {fln+l)\Gn) = E{f {fl„.+ l)\nn) 

(A. 15) implies that 

E{g(Yn+j)\'Hn, fJ-n+j) = E{g{Yn+j)\Hn+j) = j g{y)p{dy I /i„+j) (A 

(A. 17) and (A. 18) allow to prove the thesis. Indeed, 

E{g{Yn+j)\nn) = E{E{g{Yn+j)\'Hn,fin+j\'Hn) = E{ j g{y)p{dy \ ^ln+j)\'Hn) 

= E{j g{y)p{dy \ /X„+l)|^n) = E{EigiYn+i)\'Hn, f^n+i\nn) 
= E{g{Yn+i)\nn) 
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